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Abstract 

We consider distributed optimization by a collection of nodes, each having access 
to its own convex function, whose collective goal is to minimize the sum of the func- 
, tions. The communications between nodes are described by a time-varying sequence 

' of directed graphs, which is uniformly strongly connected. For such communications, 

' assuming that every node knows its out-degree, we develop a broadcast-based algo- 

rithm, termed the subgradient-push, which steers every node to an optimal value under 
a standard assumption of subgradient boundedness. The subgradient-push requires no 
knowledge of either the number of agents or the graph sequence to implement. Our 
J> . analysis shows that the subgradient-push algorithm converges at a rate of O (lnt/\/Tj, 

I where the constant depends on the initial values at the nodes, the subgradient norms, 

' and, more interestingly, on both the consensus speed and the imbalances of influence 

CN ■ among the nodes. 

cn 
o 

S!2 ■ 1 Introduction 

We consider the problem of distributed convex optimization by a network of nodes when 
^ ■ knowledge of the objective function is scattered throughout the network and unavailable at 

. any singe location. There has been much recent interest in multi-agent optimization problems 

of this type that arise whenever a large collections of nodes - which may be processors, nodes 
of a sensor network, vehicles, or UAVs - desire to collectively optimize a global objective by 
means of local actions taken by each node without any centralized coordination. 

Specifically, we will study the problem of optimizing a sum of n convex functions by 
a network of n nodes when each function is known to only a single node. This problem 
frequently arises when control and signal processing protocols need to be implemented in 
sensor networks. For example, the problems including robust statistical inference [22], for- 
mation control [2U], non-autonomous power control [25], distributed message routing 
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and spectrum access coordination [TO] , can be reduced to variations of this problem. We will 
be focusing on the case when communication between nodes is directed and time-varying. 

Distributed optimization of a sum of convex functions has received a surge of interest 
in recent years [I71|22l[Hl[l3l[IIlll2l[27l[T5l|3l[2ll[5]. There is now a considerable theory 
justifying the use of distributed subgradient methods in this setting, and their performance 
limitations and convergence times are well-understood. Moreover, distributed subgradient 
methods have been used to propose new solutions for a number of problems in distributed 
control and sensor networks [251 CHI [10] • However, the works cited above assumed commu- 
nications among nodes are either fixed or undirected. 

Our paper is the first to demonstrate a working subgradient protocol in the setting of 
directed time-varying communications. We develop a broadcast-based protocol, termed the 
subgradient-push, which steers every node to an optimal value under a standard assumption of 
subgradient boundedness. The subgradient-push requires each node to know its out-degree at 
all times, but beyond this it needs no knowledge of the graph sequence of even of the number 
of agents to implement. Our results show that it converges at a rate of O (\nt/\/Tj, where 
the constant depends, among other factors, on the consensus speed of the corresponding 
directed graph sequence and a measure of the imbalance of influence among the nodes. 

Our work is closest to the recent papers [281 IMl E]. The papers [281 ISSj proved the 
convergence of a subgradient algorithm in a directed but fixed topology; implementation 
of the protocol appears to require knowledge of the graph or of the number of agents. By 
contrast, our results work in time- varying networks and are fully distributed, requiring no 
knowledge of either the graph sequence or the number of agents. The paper [6] shows the 
convergence of a distributed optimization protocol in continuous time, also for directed but 
fixed graphs; moreover, an additional assumption is made in [6] that the graph is "balanced." 

All the prior work in distributed optimization, except for [281 EH] requires time- varying 
communications with some form of balancedness, often reflected in a requirement of having 
a sequence of doubly stochastic matrices that are commensurate with the sequence of un- 
derlying communication graphs. In contrast, our proposed method removes the need for the 
doubly stochastic matrices. The proposed distributed optimization model is motivated by 
applications that are characterized by time-varying directed communications such as those 
arising in a mobile sensor network communication where the links between nodes will come 
and go as nodes move in and out of line-of-sight or broadcast range of each other. More- 
over, if different nodes are capable of broadcasting messages at different power levels, then 
communication links connecting the nodes will necessarily be unidirectional. 

The remainder of this paper is organized as follows. We begin in Section [2] where we 
describe the problem of interest, outline the subgradient-push algorithm, and state the main 
convergence results. Section |3] is devoted to the proof of a key lemma, namely the convergence 
rate result for a perturbed version of the so-called push-sum protocol; this lemma is then 
used in the subsequent proofs of convergence and convergence rate for the subgradient-push 
in Section HI Finally, some conclusions are offered in Section [51 

Notation: We will apply boldface to distinguish between the vectors in and scalars 
associated with different nodes. For example, the vector Xj(t) is in boldface to identify a 
vector for node i, while the scalar yi{t) is not - which identifies a scalar value for node i. 
Additionally, for a vector Xj that has a subscript i identifying a node index, we will use [x]j 
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to denote its j'th entry. The vectors such as y{t) G M" obtained by stacking scalar values 
yi{t) associated with the nodes is not bolded. For a matrix A, we will use [A]ij to denote 
the 2,j'th entry of A. The vectors are seen as column vectors unless otherwise explicitely 
stated. We use 1 to denote the vector of ones, and ||y|| for the Euclidean norm of a vector y. 

2 Problem, Algorithm and Main Results 

We consider a network of n nodes whose goal is to minimize the function 

where only node i knows the convex function /j(z) : M'^ — )■ M. Under the assumption that 
the set of optimal solutions Z* = argminzg^d F(z) is nonempty, we would like to design a 
protocol in which all agents will maintain variables z.j(t) such that all the Zj(t) converge to 
the same point in Z*. 

We will assume that, at each time t, node i can only send messages to its out-neighbors in 
some directed graph G(t). Naturally, the graph G{t) will have vertex set {1, . . . ,n}, and we 
will use E(t) to denote its edge set. Also, naturally, the sequence {G(t)} should posses some 
good long-term connectivity properties. A standard assumption, which we will be making, 
is that the sequence {G{t)} is uniformly strongly connected (or, as it is sometimes called, 
S-strongly-connected), namely, that there exists some ineger B > (possibly unknown to 
the nodes) such that the graph with edge set 

(fc+l)-B-l 

EB{k)= U 

i=kB 

is strongly connected for every /c > 0. This is a typical assumption for many results in multi- 
agent control: it is considerably weaker than requiring each G{t) be connected for it allows 
the edges necessary for connectivity to appear over a long time period and in arbitrary order; 
however, it is still strong enough to derive bounds on the speed of information propagation 
from one part of the network to another. 

Finally, we introduce the notation Nf^{t) and A^°"*(t) for the in- and out-neighborhoods 
of node i, respectively, at time t. We will allow these neighborhoods to include the node i 
itselQ formally, we have 

N'm = {J\{J,^)eE{t)}U{^}, 
Nr\t) = {j\{z,j)eE{t)}U{z}, 

and di(t) for the out-degree of node i, i.e., 

d,it) = \Nrit)\. 

^Alternatively, one may define these neighborhoods in a standard way of the graph theory, but require 
that each graph in the sequence {G{t)} has a self- loop at every node. 
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Crucially, we will be assuming that every node i knows its out-degree di{t) at every time t. 

Our main result in is a protocol which successfully accomplishes the task of distributed 
minimization of F{z) under the assumptions we have laid out above. Our scheme is a 
combination of subgradient descent and the so-called push-sum protocol, recently studied in 
the papers P, HI E] . We will refer to our protocol as the subgradient-push method. 



2.1 The subgradient-push method 

Every node i will maintain auxiliary vector variables Xi(t), Wj(t) in M"^, as well as an auxiliary 
scalar variable yi(t), initialized as yi{0) = 1 for all i. These quantities will be updated by 
the nodes according to the rules. 



Zi{t+1] 



w,{t + 1) 



y^it + l)' 

X,(t+1) = W,(t + l)-«(t+l)g,(t+l), (1) 

where gi(t + 1) is a subgradient of the function fi at Zj(t + 1). The method is initiated with 
Wj(0) = Zj(0) = 1 and yi{0) = 1 for all i. The stepsize a(t + 1) > satisfies the following 
decay conditions 

oo oo 

^a(t) = cx), ^a^(t)<cx), a{t) < a{s) for alH > s > 0. (2) 
t=i t=i 

We note that the above equations have simple broadcast-based implementation: each node i 
broadcasts the quantities Xi(t) / di{t) , yi{t) / di{t) to all of the nodes in its out-neighborhooci§, 
which simply sum all the messages they receive to obtain Wj(t-M) and yi{t + l). The update 
equations for Zj(t + l),Xj(t + 1) can then be executed without any further communications 
between nodes during step t. 

Without the subgradient term in the final equation, our protocol would be a version of the 
push-sum protocol |9] for average computation studied recently in [U H] . For intuition on the 
precise form of these equations, we refer the reader to these three papers; roughly speaking, 
the somewhat involved form of the updates is intended to ensure that every node receives an 
equal weighting after all the linear combinations and ratios have been taken. In this case, the 
vectors Zj(t + 1) converge to some common point, i.e., a consensus is achieved. The inclusion 
of the subgradient terms in the updates of Xj(t + 1) is intended to steer the consensus point 
towards the optimal set Z*, while the push-sum updates steer the vectors Zj(t -|- 1) towards 
each other. Our main results, which we describe in the next section, demonstrate that this 
scheme succeeds in steering all vectors Zj(t + 1) towards the same point in the solution set Z*. 



^We note that we make use here of the assumption that node i knows its out-degree di{t). 
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2.2 Our results 



Our first theorem demonstrates the correctness of the subgradient-push method for an ar- 
bitrary stepsize a{t) satisfying Eq. (j2]); this holds under the assumptions we have laid out 
above, as well as an additional technical assumption on the boundedness of the subgradients. 

Theorem 1 Suppose that: 

(a) The graph sequence {G{t)} is uniformly strongly connected with a self-loop at every 
node. 

(b) Each function /j(z) is convex and the set Z* = argmin^g^d F(z) is nonempty. 

(c) The subgradients of each fi{z) are uniformly bounded, i.e., there exists Lj < oo such 
that 



bi 2 



< Li for all subgradients gj of fi{z) at all points z G M.'^. 



Then, the distributed subgradient-push method of Eq. ([T]) with the stepsize satisfying the 
conditions in Eq. ([2]) has the following property 

lim Zj(t) = z* for all i and for some z* G Z*. 

Our second theorem makes explicit the rate at which the objective function converges to 
its optimal value. As standard with subgradient methods, we will make two tweaks in order 
to get a convergence rate result: (i) we take a stepsize which decays as a{t) = (stepsizes 
which decay at faster rates usually produce inferior convergence rates), and (ii) each node i 
will maintain a convex combination of the values Zj(l),Zj(2), . . . for which the convergence 
rate will be obtained. We then demonstrate that the subgradient-push converges at a rate 
of 0{lD.t/\/t)] this is formally stated in the following theorem. The theorem makes use of 
the matrix A{t) that captures the weights used in the construction of Wj(t + 1) and yi{t + 1) 
in Eq. ([T]), which are defined by 

A S ^/djit) whenever j G Nt(t), 

' ~ \ otherwise. ^ ' 

Theorem 2 Suppose all the assumptions of TheoremU\ hold and, additionally, a{t) = 
for t > 1. Moreover, suppose that every node i maintains the variable Zj(t) G M*^ 
initialized at time t = 1 to Zi(l) = Zj(l) and updated as 

^ a{t + l)zi{t + l) + S{t)z,{t) 

+ = w^) ' 

where S{t) = X]l=o'^('^ + l)- Then, we have that for allt > 1, i = 1, . . . ,n, and any z* G Z* , 
F{z{t))-F{z*) < !^l|x(0)-z1|i^n(Er=i^.f (l + lnt) 




5(1 - A) 
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where 

1 " 

i=l 

and the scalars A and S are functions of the graph sequence G{1), G{2), . . . , which have the 
following properties: 

(a) For any B-connected graph sequence with a self-loop at every node, 

1 



5 > 



A < 1 - 



^riB 



(h) If each of the graphs G{t) is regulai^, then 

6 = 1 



A < min <{ ( 1 - — j , max V cr2{A{t)) 



where A{t) is defined by Eq. ([3]) and a2{A) is the second-largest singular value of a 
matrix A. 

Several features of this theorem are expected: it is standard for a distributed subgradient 
method to converge at a rate of 0(lnt/-\/t) with the constant depending on the subgradient- 
norm upper bounds Lj, as well as on the initial conditions Xj(0) [23l [5]. Moreover, it is also 
standard for the analysis of these method to involve A, which is a measure of the connectivity 
of the directed sequence G(l), G(2), . . .; namely, the closeness of A to 1 measures the speed 
at which a consensus process on the graph sequence {G{t)} converges. 

However, our bounds also include the parameter 5, which, as we will later see, is a 
measure of the imbalance of influences among the nodes. Time-varying directed regular 
networks are uniform in influence and will have 5 = 1, so that 5 will disappear from the 
bounds entirely; however, networks which are, in a sense to be specified, non-uniform will 
suffer a corresponding blow-up in the convergence time of the subgradient-push algorithm. 

Moreover, we note that while the term 1/(5(1 — A)) appearing in our bounds is bounded 
only exponentially as n^^^ in the worst case, it need not be this large for every graph 
sequence; indeed, part (b) of Theorem [2] shows that for a class of time-varying directed 
graphs, 1/(5(1 — A)) scales polynomially in n. Our work therefore motivates the question of 
obtaining effective bounds on consensus speed and imbalance of the influence in sequences 
of directed graphs. Finally, we remark that previous research [171 1231 [5] has studied the case 
when the matrices A{t) (defined in the statement of Theorem [2]) are doubly stochastic; this 
occurs when the directed graph sequence {G{t)} is regular, and in that case our polynomial 
bounds essentially match previously known results. 

•^The graph G{t) is regular if there exists some d{t) such that every out-degree and every in-degree of a 
node in G{t) equals d{t). 
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3 Perturbed Push-Sum Protocol 



This section is dedicated to the analysis a perturbed version of the so-called push-sum 
protocol, originally introduced in the groundbreaking work [9] and recently analyzed in 
time- varying directed graphs in [H H]. The push-sum is a protocol for node interaction in 
directed topologies which allows nodes to compute averages and other aggregates in spite of 
the one-way nature of the communication links. 

The original results of [HI [H H] demonstrate the convergence of the push-sum protocol. 
Here we will prove a generalization of this fact by showing that the protocol remains con- 
vergent even if the state of the nodes is perturbed at each step, as long as the size of the 
perturbations decays to zero. We will later use this result in the proof Theorems [1] and [2l 
because it has has a self-contained interpretation and analysis, we sequester it to this section. 

We begin with a statement of the perturbed push-sum update rule. Every node i main- 
tains scalar variables Xi(t) , yi{t) , Zi(t) , Wi(t) where we assume yi{0) = 1 for alH = 1, . . . , n. 
These variables are updated as follows: 



where ei(t) is a perturbation at every step, perhaps adversarially chosen. We assume that 
A^™(t) is the in-neighborhood of node i in a directed graph G(t) and dj{t) is the out-degree 
of node j, as previously defined in Section [2l 

We note that without the perturbation term ej(t), the method in Eq. (j4]) reduces to the 
push-sum protocol. Moreover, our proposed subgradient-push method of Eq. ([T]) is simply 
Eq. (jl]) with a specific form for these perturbation vectors ej(t). 

The precise form of the push-sum equations of Eq. (jlj) is a little involved. These dy- 
namics were introduced for the purpose of average computation (in the case when all the 
perturbations ei{t) are zero) and have a simple motivating intuition. The push-sum is a 
variation of a consensus-like protocol wherein every node updates its values by taking linear 
combinations of the values of its neighbors; in such schemes, some nodes are bound to be 
more influential than others (meaning that other nodes end up placing higher coefficients on 
them), for example by virtue of being more centrally placed. The dynamics of the push-sum 
are designed around the ratio Zi{t) = Wi{t)/yi{t) in which these imbalances of influence are 
meant to be cancelled so that each Zi{t) converges to (1/?^) Y^^=i^ii^)- refer the reader 
to P m [1] for more details. 

We may rewrite the perturbed push-sum equations in more compact form. Using the 
definition of A{t) from Eq. ([3]), the relations in Eq. (jlj) assume the following form: 




Zi{t + 1) 



Wjjt + 1) 

Wi{t + 1) + ei{t+l), 



(4) 



w{t + l) 



A{t)x{t), 



7 



Z^{t + 1) 



X{t + 1) 



W^it + 1) 

yi(t + i)' 

w(t + 1) + e(t + 1), 



(5) 



where e(t) = (ei(t), . . . , e„(t))'. Observe that each of the matrices A{t) is column-stochastic 
but not necessarily row-stochastic. 

We will be concerned here with demonstrating a convergence result and a convergence 
rate for the updates given in Eq. (jlj), or equivalently, in Eq. ([5]). Specifically, the bulk of this 
section is dedicated to proving the following lemma. 

Lemma 3 Consider the sequences {zi{t)}, i = l,...,n, generated by the method in 
Eq. (jl]). Assuming that the graph sequence {G{t)} is uniformly strongly connected, the fol- 
lowing statements hold: 

(a) There exists some 6 > and X G (0, 1) such that for all t > 1 we have 




Moreover, we may choose 6, X satisfying 




If in addition each of the matrices A{t) is doubly stochastic, then 




(b) If limt^o ei{t) = for all i 



1, . . . , n, then 




(c) If {a{t)} is a non-increasing positive scalar sequence such that ^ 
for all i, then 



t=i 



a{t)\ei{t)\ < oo 




t=o 



for all i = 1, . . . ,n. 
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For part (b) of Lemma [3l observe that each of the matrices A(t) is doubly stochastic if 
each of the graphs G(t) is regular. Furthermore, we observe that if ej(t) = 0, this lemma 
implies that the push-sum method converges at a geometric rate; moreover, it is easy to 
see that l'x{t)/n = l'x{0)/n and therefore Zi{t) — )■ l'x{0)/n, so that the push-sum protocol 
successfully computes the average. In the more general case when the perturbations are 
nonzero, the lemma states that if these perturbations decay to zero, then the push-sum 
method still converges. Of course, it will no longer be true in this case that the convergence 
is necessarily on the average of the initial values. 

We will prove a series of auxiliary lemmas before beginning the proof of Lemma [3l We 
first remark that the matrices A{t) have a special structure that allows us to efficiently 
analyze their products. Specifically, we have the following properties of the matrices A(t) 
(see [2], [71 [HI [TH [30] for proofs of this and similar statements). 

Lemma 4 Suppose that the graph sequence {G{t)} is uniformly strongly- connected. Then, 
the following statements are true: 

(a) For every s > 0, the limit lim^^oo A'{t)A'{t — 1) ■ ■ ■ A'{s + l)A'{s) exists. In particular, 
the limiting matrix is a rank-one stochastic matrix, i.e., there is a stochastic vector 
(f){s) such that 



for some C and A G (0, 1). 

There are also known bounds on the parameters C, A from this lemma which upper bound 
how large C is and how far away A is from 1. Moreover, these bounds improve if the sequence 
{G{t)} has some nice properties. The following lemma is a formal statement to this effect. 

Lemma 5 We have: 
(a) For any B-strongly connected sequence of graphs, we may choose 



lim A it) A' it - 1) ■ ■ ■ A\s + l)A'(s) = for all s > 0. 



(h) The convergence rate is geometric 



- 1) ■ ■ ■ + l)A'(s)],, - < CX 



for all 2, j = 1, ... ,72, 




in the statement of Lemma\^ 
(b) If in addition every graph is regular, we may choose 




9 



Proof. From [21 [TJ [30] , under the assumption of S-strong connectivity and our definition of 
neighborhoods, we have that if 

x{t) = A'{t-1)---A'{s)x{s), 

then 

/ ;^ \ l{t~s)/{nB)\ . 

max Xi{t) — min Xi{t) < I 1 ^ I I max Xi{s) — min Xi{s] 



i=l,...,n " ' i=l,...,n ' ' \ 77."^ / \ i=l,...,n ' ' i=l,...,n 



We note that this imphes 



max Xiiij) — min < 2 ( 1 I max — min 



i=l,...,n ^ ' i=l,...,n ^ ' \ V fi"'^ j I \i=l,...,n " ' i=l,...,n 

This holds for every By choosing to be each of the n basis vectors, we see that 

for every j = 1, . . . , n, 

// 1 xi/(nB)y-^ 

max[A'(t) ■ ■ ■ - m^in[A'(t) ■ ■ ■ < 2 M 1 - j j . 

Since 0j(s) is a convex combination of the n numbers ■ ■ ■ z = 1, . . . , n, we have 

proven part (a). 

As for the second statement, when each of the matrices A{t) is doubly stochastic, the 
results of [IE] imply that if 



x{t) = A'{t -!)■■■ A' {s)x{s), 

then 

[{t-s)/B] n 



where x is the average entry of x{s). Similarly, we write this as 

t-s 



(xi(s) - x)^ . 



Moreover, plugging in each basis vector, we obtain that for each j, 

t-s 



Since - /3/2 < 1 - /3/4 for all /3 G (0, 1), this implies that we may choose C = V2, X = 
(1 — l/(4n^))^/^. The same line of argument shows that we may choose C = 1 and A = 
maxt>i ^(T2(A(t)). ■ 

The next lemma provides a bound on how small the entries of the products A' (t) ■ ■ ■ A' [1] 
can get. We will use these bounds later in the proof of Lemma HI 
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Lemma 6 Given a graph sequence {G{t)}, define 

5= M min [l'A'(t) ■ ■ ■ 

t=l,2,... 1=1,. ..,n 

If G{t) is B -strongly connected, then 
If each G(t) is regular, then 



6 = 1. 

Proof. By the definition of matrices A(t) in Eq. ([3]), we liave tliat for all t > 1, 

[A'it + l)---A'{l)U>^[A'it)---A'{l)U 

and again due to the presence of self-loops, [y4'(l)]jj > 1/n for all i. Thus, we certainly have 
that [l'A'(t) ■ • ■74'(l)]j > for all i and all t in the range 1 < t < n"^^. However, it was 

shown in (TJ [20] that for t > {n — 1)B, every entry of A'[t) ■ ■ ■ A'[l) is positive and has value 
at least 1/n""^. Since n"'^ > {n — 1)B, this proves the bound S > The final claim 

that 5 = 1 for a sequence of regular graphs is trivial. ■ 

Remark 7 Observe that as an immediate consequence of the definition of 6, we have 
4>j{s) > S/n for all j = 1, . . . ,n. 

By taking transposes and applying the previous lemmas, we immediately get the following 
result on the products A{t) ■ ■ -^4(5); for convenience, let us adopt the notation of referring 
to these products as A{t : s). 

Corollary 8 Suppose that the graph sequence {G{t)} is B -strongly- connected. Then: 

(a) There is a sequence of stochastic vectors (f){t) such that the matrix difference A{t : 
s) — (f){t)l' fort > s decays geometrically, i.e., 

\[A{t : s)],, - Ut)\ < CX'-' for all z, j = 1, . . . ,n, 

where we can always choose 

C = 4,A= 

// in addition each A{t) is doubly stochastic, we may choose 

C = 2v^, A = (1-1/(4^3))'/'' 

or 



C = 2, A = max ^/^^(A{t)), 
whenever the last quantity is below 1. 



11 



(b) The quantity 

5= inf min [lA'{t) ■ ■ ■ A'{1)1 

t=l,2,... 1=1,. ..,n 

satisfies 



Moreover, if the graphs G{t) are regular, we have 5 = 1. Furthermore, we have 

6 



(pji't) > — for all times t. 
n 



Proof. The lemma follows by taking transposes and applying Lemmas HI El E] as well as 
Remark [71 The only thing that needs to be proved is that these lemmas can be applied, 
which is routine, with the excpetion of the issue of 5-connectivity, which requires further 
elaboration. 

When we transpose A{t : s) to get the product A'{s) ■ ■ ■ A'{t — l)A'(t), we have reversed 
the order in which the matrices appear; and, moreover, by taking the transposes of each 
matrix, we have effectively reversed the direction of every edge in each G{t). We must thus 
argue that when we take an initial segement of i?-connected graph sequence and reverse the 
order of the graphs as well as the direction of each edge, we still have an initial segment of 
a i?-connected sequence. Unfortunately, this is not true; however, it is easy to see that it 
is true after we throw out at most B graphs from the start of the sequence. The bound of 
Lemma m part (b) thus applies with t — s replaced by t — s — -B, which we take care of by 
doubling the constants C instead. ■ 

We remark that 5 as defined in Lemma El may be thought of as a measure of imbalance 
of the infiuence among the nodes. Indeed, 5 is defined as the best lower bound on the row 
sums of the matrices A{t : s). In the case when each of the graphs G{t) is regular, the 
matrices A{t) will be doubly stochastic and we will have 5 = 1 as previously remarked (i.e., 
no imbalance of infiuence). By contrast, when 5 0, there is a node i such that the z'th row 
in some A{t : s) will have entries which are all nearly zero; in short, it is almost as if node i 
has no in-neighbors at all and it is uninfiuenced by what occurs in the rest of the network. 
In cases intermediate between these two extremes, 6 refiects the existence of a node whose 
infiuence on the rest of the network, measured by summing up all the weights placed on it 
in A{t : s), is small. 

We proceed with our sequence of intermediate lemmas for the proof of Lemma El We 
will now need some auxiliary results on the convolution of two scalar sequences, as in the 
following statement. 



Lemma 9 ([23] Lemma 3.1) Let {'~fk} be a scalar sequence. 

(a) //limfc_^oo7fc = 7 and < (3 < 1, then limfc_^oo ELo/^^^^7^ = T^- 

(b) If Ik > for all k, EfcLo7fc < oo and < f3 < 1, then Y.kLo (ELo/^^~^7^) < 



(c) // lim sup^^oQ 7fc = 7 and {(k} is a positive scalar sequence with YlT=o^i' ~ then 
limsupfc_^oo t < 7- In addition, z/liminffc^oo 7fc = 7; thenlimk^oo = 7- 
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With these pieces in places, we can now proceed to the proof of Lemma [31 Our argument 
will rely on Corollary |8] on the products A{t : s) and the just-stated Lemma [9] on sequence 
convolutions. 

Proof of Lemma [S]. (a) By inspecting Eq. ([5]) it is easy to see that 

t 

x{t + 1) = A{t : O)x(O) + ^A{t: s)e{s) + e(t + 1). (6) 

s=l 

which implies 

t+i 

A{t + l)x{t + l) = A{t + l: O)x(O) + ^A{t + l: s)e(s). (7) 

s=l 

Moreover, since each A{t) is column-stochastic, we have that V A{t) = 1' and Eq. ([6]) also 
implies that 

t+i 

l'x(t + l) = l'x(0) + J]l'e(s). (8) 

s=l 

Now Eq. ([7]) and Eq. ([8]) give us that 

+ + + + = (A(t + 1 : 0) + l)l')x(O) 

t+i 

+ ^(A(t + l:.)-0(t+l)l')e(s). (9) 

s=l 

However, according to Corollary [8], if we define D{t, s) to be 

D{t,s) = A{t : s) -0(^)1' 

then we have the entry-wise decay bound 

\[D{t, s)]ij\ < C\^-' for all i,j and t > 0, (10) 

for some constants C > and A G (0, 1). Moreover, the constants C and A have all properties 
listed in Corollary [HI 

Therefore, from relation ([9]) it follows for t > 0, 

t+i 

A{t + l)x{t + 1) = 0(t + l)l'x{t + 1) + D{t + 1, O)x(O) + ^D{t + 1, s)e(s). 

s=l 

Thus, for t > 1 we have 

t 

w{t + 1) = A{t)x{t) = (j){t)l'x{t) + D{t, O)x(O) + D{t, s)e(s). (11) 

s=l 

We may derive a similar expression for y{t + 1): 

y{t + 1) = A{t : O)y(O) = 0(t)l'2/(O) + D{t, 0)7/(0) = 0(t)n + D{t : 0)1, (12) 
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From (fTT!) and (fT2|l we obtain for every t > 1 and all i, 

Wiit + 1) _ l'x(t) + : O)x(O)], + ELilDit : s)e(s)]i 



yi{t + l) 



(j)i{t)n+[D{t:0)l]i 



Therefore, 



Zi{t + 1) 



l'x(t) _ </.i(t)l'x(t) + [/^(t:0)x(0)], + E*^jD(t: s)e(s)]i 



n 



^i{t)n+[D{t:0)l]i n 
n {[D{t : O)x(O)], + T!s=i\D{t : s)e(s)],) - : 0)1], 



n{(f)i{t)n+[D{t: 0)1],) 

Observe that the denominator of the above fraction is n times the i'th row sum of A[t : 0). 
By definition of 6, this row sum is at least 6, and consequently 



n {^i{t) n + [Dnit : 0)1],) > nS 



Thus, 



Zi{t + 1) 



rx{t) 



n 



< 



\n{[D{t:0)xm, + Zl=im:s)e{s)],)\ 


+ 


l'x{t)[D{t : 0)1],| 


n\(P,{t)n+[D{t : 0)1] 


* 1 



< 



^ (n (ms.x\[D{t : 0)]i,|^ ||a;(0) ||i + n ^ (ma.x\[D{t : s)],j 



e s 1 



+ |l'x(t)| (^max\[D{t : 0)]ij\j n 
Factoring an n out and using estimates for \\D{t : s)]jj| as given in (fTOj) . we find 



z^t + 1) 



Vx{t) 



n 



C 



< J A*||x(0)|K + 5^A*-1|6(.)|K + |l'x(t)|A* 



s=l 



Now we look at the term l'x{t). From Eq. ([8]), we have 

t 

|l'x(t)|<||x(0)|K + 5^||e(s) 



Going back to our estimate. 



Zi{t + 1) - 



I'xit) 



n 



C 



< J A*||x(0)||i + 5^A*-||6Gs)|K + AM||x(0)||i + 5^||e(.)|K 



c 



J (2A*||x(0)|K + 25^A*-||e(s)|K). 



s=l 



Since we were able to choose C < 4 in all the cases considered in Lemma [H we may choose 
C = 4 to obtain a proof of part (a). 
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(b) By letting t — )■ oo in the preceding relation, since A G (0, 1), we find that 

l'x{t) 



lim 

t—^OD 



Zi{t + 1) 



n 



< 



lim > A 



t-s\ 



e s 1- 



s=l 



When ei(t) — for all i, then ||e(t)||i — and, by Lemma [HI^ a), we conclude that 



Hm^A*-||e(.)||i = 0, 



s=\ 



and the result follows from the preceding two relations. 

(c) Since {a;(t} is positive and non-increasing sequence, we have ait + 1) < a;(0) and 
ait + 1) < for all s <t. Using these relations, we obtain 



a(t + 1) 



z^{t\\) 



Vx{t) 



n 



2C 
< - ,a 



(0)A*|KO)|K + J]A*-^a(.)||6(s)||i . (13) 



Since A G (0, 1), the sum Ylu=i finite, and by Lemma[9]^b) the sum Yl'^i Yll=i 
is finite. Therefore 

. , Vxit) 
z,{t + l) ^ 



J]a(t + 1) 



i=l 



n 



< oo. 



Lemma [3l which we have just proved, is the central result of this section; it states that 
each of the sequences Zi{t + 1) tracks the average xit) = Vx{t)/n increasingly well as time 
goes on. We will later require a corollary of this lemma: we will need the fact that a weighted 
average of the Zi{t + 1) tracks a weighted averages of x{t). The next corollary gives a precise 
statement of this. 

Corollary 10 Suppose all the assumptions of Lemma [3 are satisfied, and moreover 
a{t) = 1/Vt and the size of the perturbations ei{t) is bounded as \\£i{t)\\i < D/\/i. Defining 



1 " 

x(t) = - Vxi(t), 
n ^-^ 



we then have that for every i = 1, . . . ,n, 



Proof. Indeed, on the one hand 



J2(^ik + l)\zi{k + l)-x{k)\ <4 



||x(0)||i + D(l + lnt) 
6{1- X){VtT2-l)' 



* rt+i I , 

Va(A; + l)> / rh, = 9 ( + 9 - 1 



(14) 
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On the other hand using Lemma [3l 

A 



Y,c^{k + l)\m-^^{k + l)\ < ^5^^=||a:(0)||i + ^5^a(A; + l)5^A^-^||6(s)|K 



8 1 ,, / , I, 8/^ V — -\ V — -\ A 

fc=l s=l 



8 „ ,^,„ 8D(l + lnt) ,^ , 



We conclude this section by remarking that Lemma [3] and Corollary [TO] hold even if 
Xi{t) (and, by extension, ziit)) is a (i- dimensional vector, by applying the results to each 
coordinate component of the space. 



4 Convergence Results for Subgradient-Push Method 

We turn now to the proofs of our main results. Theorems [1] and [2l Our arguments will cru- 
cially rely on the convergence results for the perturbed push-sum method we have established 
in the previous section. 

We give a brief, informal summary of the main ideas behind our argument. The con- 
vergence result for the perturbed push-sum method of the previous section implies that, 
under the appropriate assumptions, the entries of •Ziit) get close to each other over time, 
and consequently Zj(t) approaches a multiple of the all-ones vector. Thus every node takes 
a subgradient of its own function /j at nearly the same point; over time, these subgradi- 
ents are averaged by the push-sum-like updates of our method, and the subgradient push 
approximates the ordinary subgradient algorithm applied to the average function ^ X]j=i fj- 

We now begin the formal process of proving Theorems [T] and [2J Our first step will be 
to establish two lemmas pertaining to the convergence of a scalar sequence. The first is a 
deterministic counterpart of the well-known supermartingale convergence result ([26]; see 
also [21], Lemma 11, Chapter 2.2). 

Lemma 11 Let {vt} he a non-negative scalar sequence such that 

Vt+i < (1 + bt)vt — Uf + Ct for all t >0, 

where h > 0, Ut > and Q > for all k with ^* ^ ^'^'^ J2t^o^t Then the 

sequence {vt} converges to some v >0 and ^^g""* ^ 

We can use this lemma to derive the convergence of a sequence satisfying a subgradient- 
like recursion, as in the following lemma. 

Lemma 12 Consider a convex minimization problem f{x), where f : M™ — )■ M 

is a convex function and X C is a convex closed set. Assume that the solution set X* of 
the problem is nonempty. Let {xt} be a sequence such that for all x E X , 

\\xt+i - x|p < (1 + bt)\\xt - x|p - at ifixt) - f{x)) + ct for all t > 0, 
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where bt > 0, at > and Q > for all k with Xlt^o ^* ^ ^t^o C(t = oo and Xlt^o < oo. 
Then, the sequence {xt} converges to some solution x* of the problem. 

Proof. By letting x = x* for arbitrary x* G X*, we obtain 

\\xt+i - x*f < (1 + bt)\\xt - x*f - at {f{xt) - fix*)) + ct for all t > 0. 

Thus, all the conditions of Lemma [TT] are satisfied, and by this lemma we obtain the following 
statements: 

the sequence — is convergent for every x* G X*, (16) 

oo 

Y,c^t{f{xt)-f*)<oo, (17) 

t=0 

where /* = minx^x f{x). Since Yl^o'^t = oo, it follows from (fT7|) that 

liminf f{xt) = f*. 

Let {xt^} be a subsequence of {xt} such that 

lim/(xi,)= liminf /(Xi) = r. (18) 

Now Eq. (fT6|) implies that the sequence Xt takes values in a compact set, thus we can 
assume without loss of generality that {xt^} is converging to some x (for otherwise, we can 
in turn select a convergent subsequence of {xt^})- The function / is convex over M™, so it is 
continuous. Therefore, 

lim f{xt,) = fix), 

which by (ITSl) implies that x G X*. By letting x* = x in f|T6|) we obtain that the whole 
sequence {xt} must converge to x. ■ 

A key step in the proofs of Theorems [1] and [2] will be obtained by applying the previous 
lemma to the average-value process 



1 " 



n 
1=1 

where Xj(t) is the sequence generated by the subgradient-push method. We will need to 
argue that x(t) satisfies the assumptions of the previous lemma, for which the following 
statement will be instrumental. 

Lemma 13 Under the same assumptions as in TheoremUi we have for all v G M'' and 
t > 0: 

12 / u-u^ 112 2a(t + l) 



|x(t + l)-v||^ < ||x(t)-v||^ ^^(F(x(t))-F(v)) 



n 



^ — 5^iv,||z,(t+l)-x(t)||+«2(t ' 



\2 



n — ' 

i=l 
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Proof. Let us define xe{t) to be the vector in which stacks up the £'th entries of all the 
vectors Xj(t): formally, we define Xe{t) to be the vector whose j'th entry is the €th entry of 
Xj(t). Similarly, we define ge{t) to be the vector stacking up the tth entries of the vectors 
gi{t): the j'th entry of ge{t) is the £'th entry of gj(t). 

It is easy to see from the definition of the subgradient-push in Eq. ([T]) that 

Xi{t + 1) = A{t)xeit) - a{t + l)gi{t +1) for ^ = 1, . . . , 

Since A{t) is a column-stochastic matrix, this implies for all £ = 1, . . . , ci, 

1 " 1 " (f ^\ ^ 

- Y,[x,{t + 1)], = - Y.[x,{t)i - Y.\3^{t + 1)],. 

j=i j=i j=i 

Since the tth entry of x(t + 1) is exactly the left-hand side above, we can conclude that 

x(t + 1) = x(t) - fl + (19) 

i=i 

Now let V G M*^ be an arbitrary vector. From relation ( !T9|) it follows that for all t > 0, 



|x(t + l)-vf =||x(t)-vf-^^^X^g:(t+l)(x(t)-v 



+ 1) 



Since the subgradient norms of each /j are uniformly bounded by Lj, it further follows that 
for all t > 0, 



|x(t + 1) - vlp < l|x(t) - v|p - ^^(i±l) ^ g^(t + l)(x(t) - v) + a\t + 1) 



n ^ — ' " ■ ■ 

1=1 



|2 

(20) 



where Lp = (Li, . . . , 

We next consider the cross-term Yl^=i + — v) in f l20|) . For this term, we write 

5^g:(t + l)(x(t)-v) = ^g:(t + l)((x(t)-z,(t + l)) + (z,(t + l)-v)). (21) 

1=1 'i=i 

Using the subgradient boundedness and Cauchy-Schwarz, we can lower bound the first term 
g^(t + l)(x(t) -Zi(t + 1)) as 

g[it + l)(x(t) - z,(t + 1)) > -L,||x(t) - z,(t + 1)||. (22) 

As for the second term g-(t + l)(z.j(t + 1) — v), we can use the fact that g-(t + 1) is the 
subgradient of fi{0) at 6* = Zj(t + 1) to obtain: 

g,(t + l)(z,(t + 1) - v) > /,(z,(t + 1)) - /,(v), 

from which, by adding and subtracting x(t) and using the Lipschitz continuity of fi (implied 
by the subgradient boundedness), we further obtain 

g,(t + l)(z,(t + l)-v) > /,(z,(t + l))-/,(x(t)) + /,(x(t))-/,(v) 
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> -L,||z,(t+l)-x(t)||+/,(x(t))-/,(v). (23) 



By substituting the estimates obtained in ( !22l) - (l23l) back in relation (l2Tll . and using -F(x) 
Yh=i M^) we obtain 



5^g,:(t + l)(x(t)-v) > F(x(t))-F(v)-2 5^L,||z,(t + l)-x 



(24) 



i=l 



i=l 



Now, we substitute estimate (IMl) into relation (120|) and obtain for any v G M*^ and all t > 0, 

2 2a(t + l) 



|x(t+l)-v|p < ||x(t)-v| 



(F(x(t))-F(v)) 



+ — V Li z^{t + 1 - X t + t + 1 A 

1=1 



With all the pieces in place, we are finally ready to prove Theorem [H The proof idea is 
to show that the averages x(t), as defined in Lemma [T3| converge to some solution x* G X* 
and then show that Zj(t + 1) — x(t) converges to for all i, as t — >■ oo. The last step will be 
accomplished by invoking Lemma [3] on the perturbed push-sum protocol. 

Proof of Theorem [H We begin by observing that the subgradient-push method may be 
viewed as an instance of the perturbed push-sum protocol. Indeed, let us adopt the notation 
Xi{t),ge{t) from the proof of Lemma fT3l and moreover let us define Wi{t),Z£{t) identically. 
Then, the definition of subgradient-push implies that for all 



l,...,d, 



wi{t + 1) 

yit + i) 

ze{t + 1) 
we{t + 1) 



A{t)xi{t), 

yjit) ' 

xiit + l) - a{t + l)gt{t + l). 



Since a{t) 0, the assumptions of Lemma [3] are satisfied with e(t -|- 1) = a{t + l)gi{t + 1), 
from Lemma EJl^b) we obtain the conclusion 



lim 

t— >CX3 



E"=ikK^)]j 



n 







for all 



1, . . . ,d and all i = 1, . . . , n. 



which is equivalent to 



lim \\zi{t -|- 1) — X 







for alH = 1 n. 



(25) 



Next, we apply Lemma[3](c). Since the subgradients gi(s) are uniformly bounded, and {a(t)} 
is non-increasing and such that X^t^i ct'^it) < oo, from e{t -|- 1) = a{t + l)g£{t + 1) it follows 
that for alH = 1, . . . , n and £ = 1, . . . , ci, 



^«(t)|ei(t + 1)1 < ^«(t)«(t + l)||^,(t + 1)|U < ^a2(t)||g,(t + 1)11^ < oo. 



t=i 



t=i 



t=i 
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In view of the preceding relation and the assumption that the sequence is non- 

increasing, by applying Lemma [3](b) to each coordinate £ = 1, . . . ,d, we obtain 



t=o 



[ze[t + - - 



n 



< oo for aA\ i = 1, . . . ,d and alH = 1 



which implies that 

oo 

+ l)||zi(t + 1) -x(t)|| < cx) for alH = (26) 

i=0 

Next we consider Lemma [13] where we use \ = z* for some solution z* & Z*, 



|x(t + l) < ||x(t) -z*f- ^"^^ ^ (i^(x(t)) - F* 

n 



IT _^2 



^4a(t+2) y. ^ ^ ^ 1)^^, (27) 

1=1 

where F* is the optimal value (i.e., F* = F{z*) for any z* G Z*). In view of Eq. (126|) . it 
follows that 

— ^ n ^ — ^ 

t=i i=i 

Also, by assumption we have that Yl't^i^i^) — < Thus, all the 

conditions of Lemma [12] are satisfied, and by this lemma we conclude that the average 
sequence {x(t)} must converge to some solution 'zeZ*. By relation ([25]) it follows that the 
sequence {zj(t)}, i = 1, . . . ,n, also converges to the solution £ ■ 

Having proven Theorem [T] we now turn to the proof of the convergence rate results of 
Theorem [2] The first step will be a slight modification of the result we have just proved - 
whereas the proof of Theorem [1] shows that F(x(t)) — )■ F{z*), we will now need to argue 
that we can replace F(x(t)) by F evaluated at a running average of the vectors x(t). This 
is stated precisely in the next lemma. 



Lemma 14 If all the assumptions of TheoremUl are satisfied and a(t) = 1/vt, then for 
allt > 1, 

El=o^ik + mk)\ p^^.^ ^ n||x(0)-z*||i , {Y^tiUfil + \^t) 



n=o«(^ + l) / 4(Vt + 2-l) An{Vt + 2-l) 

iEtiL,) (e;=i I|x,(0)||i) + (ELi^D (1 + int) 
6{l-X){Vt + 2-l) ■ 

Proof. From Lemma [13] we have for any v, 

^ Mill) < ,|,(o)_vf + t^^Ei.N<(^+l)-'' 

k=0 k=0 2=1 
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k=0 



From the preceding relation, recalling the notation S{t) = X]fc=o'^(^ + 1) ^^"^ dividing by 
{2/n)S{t + 1), we obtain for any v G M"', 



S{t+1) 



+ 



S{t 



- J] + 1) ^ L,||z,(A: + 1) - x(A:) || + -^^^ ^ ^^(A; + 1 



fc=0 i=l ^ ' fc=0 

Now setting v = z* for some z* G and using the convexity of F, we obtain 



2n 



< 



ELo«(fc + iMfc) 

tl|x(0)-z-f ^ 
S{t+1) 



F{z*) < 



ELo«(fc + imx(fc)) 



-F(z*) 



5(t 



— ^a(A; + l)5^L,||z,(/c + l)-x(/c)|| 



A:=0 



i=l 



2n 



{2t 



We now bound the quantity ||zj(t + 1) — x(t)|| in Eq. f l28|) by applying Corollary [TO] to each 
component of it; specifically, we will apply Eq. ( JTSj) derived in the proof of that corollary. In- 
deed, observe that all the assumption of that corollary have been assumed to hold; moreover, 
the constant D in the statement of that corollary equals Li when the corollary is applied to 
the tth component ||zj(t + 1) — x(t)||. Therefore, adopting our notation of xi(t) and zi(t) 
as the vectors that stack up the £'th components of the vectors Xj(t),Zj(t), we get that for 



all i = 1, . . . , d and i = 1, . . . ,n, 



* 1 1 " 



< 



fc=0 V - I - - j^i 

which after some elementary algebra implies that 



5(1 - A) 



iLil + lnt 
T 1-X '' 



J2(y{k+l)Y,L^\\z,{k+l)-^{k)\u 



< 



8 ((ELi L.) (e;=i iix,(o)iii) + (Er=i Li) (1 + int)) 



k=0 



1=1 



5(1 -A) 



(29) 

Now, using the fact that the Euclidean norm of a vector is not more than its 1-norm, we 
substitute Eq. (12^]) into Eq. f l25]) . Then, using the definition of S(t) and Eq. (HM we bound 
the denominator in Eq. ( 128|) as follows 



r 

S{t + 1) = ^ a(A: + 1) > 2 (ytT2 - l) . 



fc=0 
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The preceding relation and 

t t+i , 



fc=0 s=l "'I 



yield the stated estimate. ■ 

We are now finally in position to prove Theorem [21 At this point, the proof is a simple 
combination of Lemma [TH which tells us that F [ ) approaches Fiz*), along 

with Corollary [101 which tells us that "''^"t^)^!^'' and ^fc=o^°^(^+|K(^+i) ^^^^ close to 
each other over time for all z = 1, . . . , n. 

Proof of Theorem [51 It is easy to see by induction that the vectors z(t) defined in the 
statement of Theorem [21 are equal to 

By the boundedness of subgradients and Corollary [TUl we obtain that 



ELo"(* + l) / A(l-A)(v/(T2-1) 
Finally, Lemma [TH along with the inequality 2[^t + 2 — 1) > ^/t now implies Theorem [21 



5 Conclusions 

We have introduced the subgradient-push, a broadcas-based distributed protocol for dis- 
tributed optimization of a sum of convex functions over directed graphs. We have shown 
that, as long as the communication graph sequence {G{t)} is uniformly strongly connected, 
the subgradient-push succeeds in driving all the nodes to the same point in the set of opti- 
mal solutions. Moreover, the objective function converges at a rate of 0(lnt/-\/t), where the 
constant depends on the initial vectors, bounds on subgradient norms, consensus speed A of 
the graph sequence as well as a measure of the imbalance of infiuence 5 among the 

nodes. 

Our results motivate the open problems associated with understanding how the consensus 
speed A depends on properties of the sequence {G{t)}. Similarly, it is also natural to ask 
how the measure of imbalance of infiuence 5 depends on the combinatorial properties of the 
graphs namely how it depends on the diameters, the size of the smallest cuts, and 

other pertinent features of the graphs in the sequence {G{t)}. 
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